-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(wren-ai-service): sql pairs #961
base: main
Are you sure you want to change the base?
Conversation
0d5aa5d
to
bf2515b
Compare
1f952c1
to
3142348
Compare
3142348
to
63999d4
Compare
- name: sql_pairs_preparation | ||
document_store: qdrant | ||
embedder: openai_embedder.text-embedding-3-large | ||
llm: litellm_llm.gpt-4o-mini |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about rename it to sql_pairs_indexing? i think it will be more suitable style as other
docker/config.example.yaml
Outdated
- document_store: qdrant | ||
embedder: openai_embedder.text-embedding-3-large | ||
llm: litellm_llm.gpt-4o-mini | ||
name: sql_pairs_preparation | ||
- document_store: qdrant | ||
embedder: openai_embedder.text-embedding-3-large | ||
name: sql_pairs_deletion | ||
- document_store: qdrant | ||
embedder: openai_embedder.text-embedding-3-large | ||
name: sql_pairs_retrieval | ||
llm: litellm_llm.gpt-4o-mini |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you sort the pipe name on top? it will be better for us
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
def __init__(self, sql_pairs_store: DocumentStore) -> None: | ||
self._sql_pairs_store = sql_pairs_store | ||
|
||
@component.output_types(documents=List[Document]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about the output type. will it return a list of doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
def visualize(self, sql_pair_ids: List[str], id: Optional[str] = None) -> None: | ||
destination = "outputs/pipelines/indexing" | ||
if not Path(destination).exists(): | ||
Path(destination).mkdir(parents=True, exist_ok=True) | ||
|
||
self._pipe.visualize_execution( | ||
["delete_sql_pairs"], | ||
output_file_path=f"{destination}/sql_pairs_deletion.dot", | ||
inputs={ | ||
"sql_pair_ids": sql_pair_ids, | ||
"id": id or "", | ||
**self._components, | ||
}, | ||
show_legend=True, | ||
orient="LR", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gently remind, remove this method, tks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
@observe(name="SQL Pairs Deletion") | ||
async def run( | ||
self, sql_pair_ids: List[str], id: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm considering about using project_id instead of id for the naming more purposeful. wdyt?
""" | ||
|
||
|
||
@router.post("/sql-pairs-preparations") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about removing -preparations
? i think it is a bit redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
return DeleteSqlPairsResponse(sql_pairs_preparation_id=id) | ||
|
||
|
||
@router.get("/sql-pairs-preparations/{sql_pairs_preparation_id}/status") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we could remove /status
to make it more concise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you remove this test file? it seems to be duplicated with test_historical_question.
from src.pipelines.generation import intent_classification, sql_generation, sql_summary | ||
from src.pipelines.retrieval import historical_question, retrieval | ||
from src.pipelines.retrieval import historical_question_retrieval, retrieval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we could directly access from generation and retrieval module. could you also modify the import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -14,7 +14,7 @@ | |||
sql_correction, | |||
sql_generation, | |||
) | |||
from src.pipelines.retrieval import historical_question, retrieval | |||
from src.pipelines.retrieval import historical_question_retrieval, retrieval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as the previous comment, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
WalkthroughThis pull request introduces significant enhancements to the Wren AI service, focusing on SQL pairs management, pipeline restructuring, and removal of timer-related functionality. The changes include new pipelines for SQL pairs preparation, deletion, and retrieval, a consolidation of module imports, and the removal of the Changes
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
💤 Files with no reviewable changes (1)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🧹 Nitpick comments (41)
wren-ai-service/src/web/v1/services/chart_adjustment.py (1)
9-9
: Confirmed import replacement fromasync_timer
totrace_metadata
.Previously, the
async_timer
import was used for timing. Now you’ve replaced it withtrace_metadata
, which focuses on tracing rather than capturing performance metrics. This is okay if timing is no longer needed. If performance measurement or timing is still required, consider adding an alternative.wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py (6)
1-16
: Ensure consistent logging strategy for debugging and production.
While the imports and logging setup appear fine, consider whetherlogger.setLevel(...)
or additional contextual logs are needed to help debug issues in production, especially given the asynchronous nature of this pipeline.
18-34
: Enhance documentation for OutputFormatter.
This class is straightforward, but consider adding a docstring or clarifying comment explaining how the output structure is intended to be consumed by downstream processes.
61-82
: Ensure filter conditions align with Qdrant indexing.
Ifid
is empty or None, thefilters
object is null. Confirm that an unfiltered retrieval is appropriate. This could cause performance/memory issues when the index is large.
84-90
: Parameterize score thresholds in filtered_documents.
The filter usesscore=0.7
as a hard-coded threshold. Consider making it configurable, possibly from environment variables or pipeline settings, to allow more flexible tuning.
105-127
: Add docstring for the SqlPairsRetrieval class.
This class orchestrates the retrieval pipeline. A brief docstring explaining the main usage, inputs, outputs, and expected environment would help future maintainers.
141-148
: Remove or secure the main block in production.
This main block is useful for local testing but may not be intended for production. Consider guarding it with anif DEBUG:
check or removing it in deployment contexts.wren-ai-service/src/web/v1/services/sql_pairs_preparation.py (6)
1-12
: Consistently handle exceptions in service modules.
While logging is present, consider a centralized error-handling mechanism or standardized exception classes for consistent error responses across all service modules.
34-37
: Validate the necessity of a dedicated response model.
SqlPairsPreparationResponse
only has a single field. Consider if it could be combined with the request or a more general wrap.
39-55
: Service layering for deletion.
DeleteSqlPairsRequest
andDeleteSqlPairsResponse
mirror the prepare request/response. Validate usage to avoid duplication. If differences are minimal, a shared model might be more maintainable.
62-69
: Assess error details in SqlPairsPreparationStatusResponse.
SqlPairsPreparationError
uses a very general codeOTHERS
. If more error categories exist or are needed, consider enumerating them for clearer telemetry and troubleshooting.
71-81
: Ensure TTLCache suits concurrency and memory constraints.
The cache is set to a potentially largemaxsize=1_000_000
. Confirm that this usage pattern doesn’t cause memory issues under high load.
169-188
: Distinct logging for “id not found” in get_prepare_sql_pairs_status.
Currently logs as anexception
. Consider logging at a warning or info level if it’s expected that some requests might query missing IDs.wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py (6)
1-21
: Add docstrings or references for domain usage.
The imports and usage are not fully explained in the code. For maintainability, clarify in docstrings how these providers and classes fit into the broader SQL pairs preparation architecture.
80-87
: Document usage of prompts function.
This function transforms raw SQL pairs into prompt dicts. A short docstring explaining the rationale for theprompt_builder
would help future maintainers.
89-99
: Parallel tasks with gather: handle partial failures.
If one generation task fails,asyncio.gather()
raises, discarding all results. Consider usingreturn_exceptions=True
or a fallback mechanism if partial progress is acceptable.
126-134
: Check embedding context.
If the documents are large, consider chunking or partial embedding strategies. Large documents can cause memory or performance bottlenecks.
157-160
: Expand on field usage in SqlIntentionGenerationResult.
Currently, onlyintention
is stored. If more fields are needed later (e.g., potential error code, data lineage), consider extending this model now for future-proofing.
173-217
: Centralize pipeline component initialization.
SqlPairsPreparation
includes numerous components in_components
. If more logic is added, consider splitting into separate factory methods or classes to maintain clarity.wren-ai-service/src/globals.py (13)
9-9
: Optimize import style.
While wildcard imports can be convenient, consider using explicit imports for clarity and to avoid naming conflicts, especially in large codebases.
20-21
: Consistent naming for new services.
SqlExplanationService
andSqlPairsPreparationService
follow a different naming style from older classes. Confirm naming guidelines to keep the codebase consistent.
39-41
: Documentation for newly introduced class members.
sql_explanation_service
,sql_regeneration_service
, andsql_pairs_preparation_service
are newly added. Add docstrings describing their purpose.
61-61
: Track semantic version changes for generation pipeline.
Modules likegeneration.SemanticsDescription
are newly integrated. If new features are introduced, reflect that in service versioning or changelogs.
84-87
: Standardize specialized pipeline references.
intent_classification
anddata_assistance
are grouped here. Confirm these belong in generation pipelines. Possibly rename them to reflect domain usage.
100-102
: Assess performance implications of sql_pairs_retrieval.
It’s newly introduced, and retrieving large volumes of pairs can be memory-intensive. Evaluate the need for pagination or partial retrieval.
119-122
: Guard or configure chart generation.
Chart generation can be resource-hungry. If runtime constraints exist, consider gating or caching repeated chart generation requests.
130-133
: Enforce consistent naming between chart_generation and chart_adjustment.
One pipeline ischart_generation
, the other ischart_adjustment
. Validate that this difference is intentional and documented.
152-155
: Docstring improvement for ask_details_service.
Explain howsql_breakdown
andsql_summary
pipelines collectively handle the request.
174-180
: Consolidate naming for sql_summary.
sql_summary
is used in multiple services. If they differ in usage or logic, clarify via better naming or shared references.
182-188
: Potential to unify explanation logic.
sql_explanation_service
might overlap partially withsql_summary
. If so, consider unifying or layering them to reduce duplication.
190-198
: Document advanced usage for sql_regeneration_service.
Regeneration can be complex. Provide docstrings or references regarding the logic behind regenerating queries.
221-231
: Integration tests for SQL pairs prep.
Before finalizing, ensure integration tests are available that callsql_pairs_preparation_service
end to end, verifying typical and edge use cases (e.g., empty requests, large-scale requests).wren-ai-service/src/pipelines/retrieval/__init__.py (1)
1-5
: Import clarity for new retrieval classes.
You have introduced five new imports here. Ensure each class is truly needed by external modules and that no cyclical dependencies occur.wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py (1)
59-67
: Dry run invocation
Providing a main block for quick debugging or demonstration is helpful. Any advanced dry-run logic (e.g., fetching test data) could be added if needed.wren-ai-service/src/pipelines/retrieval/historical_question_retrieval.py (1)
89-89
: [Parameterization suggestion] The threshold value 0.9 might be preferable as a configurable setting.
Hardcodingscore=0.9
may hamper flexibility if different retrieval contexts demand varying thresholds. Consider passing this as a parameter or configuring it externally for better adaptability.wren-ai-service/src/pipelines/generation/sql_generation.py (1)
58-62
: Include sanitization or validation for sample data.These lines introduce a template section for SQL samples. Consider validating or sanitizing the
sample.summary
andsample.sql
fields to avoid potential injection or formatting issues in downstream usage.wren-ai-service/src/web/v1/services/ask.py (1)
270-276
: Newsql_pairs_retrieval
pipeline usage.
The introduction ofsql_pairs_retrieval
is coherent with the PR’s objective to manage SQL pairs. Ensure you handle empty results or errors consistently in the pipeline.docker/config.example.yaml (1)
113-113
: Fix trailing whitespace.Remove the trailing space after
openai_embedder.text-embedding-3-large
.- embedder: openai_embedder.text-embedding-3-large + embedder: openai_embedder.text-embedding-3-large🧰 Tools
🪛 yamllint (1.35.1)
[error] 113-113: trailing spaces
(trailing-spaces)
wren-ai-service/src/pipelines/indexing/__init__.py (1)
90-110
: Consider handling the return value from delete_documents.The implementation looks good, but the return value from
delete_documents
might contain useful information about the number of documents deleted. Consider either:
- Removing the explicit return statement since it's not needed, or
- Returning the result to the caller for potential logging or verification.
- return await self._sql_pairs_store.delete_documents(filters) + result = await self._sql_pairs_store.delete_documents(filters) + logger.info(f"Deleted {result} SQL pair documents") + return resultwren-ai-service/tools/config/config.example.yaml (1)
131-131
: Fix trailing whitespace.Remove the trailing space after
openai_embedder.text-embedding-3-large
.- embedder: openai_embedder.text-embedding-3-large + embedder: openai_embedder.text-embedding-3-large🧰 Tools
🪛 yamllint (1.35.1)
[error] 131-131: trailing spaces
(trailing-spaces)
wren-ai-service/tools/config/config.full.yaml (1)
150-150
: Remove trailing whitespaceThere is a trailing space after
openai_embedder.text-embedding-3-large
.- embedder: openai_embedder.text-embedding-3-large + embedder: openai_embedder.text-embedding-3-large🧰 Tools
🪛 yamllint (1.35.1)
[error] 150-150: trailing spaces
(trailing-spaces)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (65)
deployment/kustomizations/base/cm.yaml
(2 hunks)docker/config.example.yaml
(1 hunks)wren-ai-service/README.md
(0 hunks)wren-ai-service/docs/configuration.md
(0 hunks)wren-ai-service/eval/evaluation.py
(1 hunks)wren-ai-service/eval/pipelines.py
(2 hunks)wren-ai-service/src/config.py
(0 hunks)wren-ai-service/src/globals.py
(8 hunks)wren-ai-service/src/pipelines/generation/__init__.py
(1 hunks)wren-ai-service/src/pipelines/generation/chart_adjustment.py
(1 hunks)wren-ai-service/src/pipelines/generation/chart_generation.py
(0 hunks)wren-ai-service/src/pipelines/generation/data_assistance.py
(0 hunks)wren-ai-service/src/pipelines/generation/followup_sql_generation.py
(6 hunks)wren-ai-service/src/pipelines/generation/intent_classification.py
(0 hunks)wren-ai-service/src/pipelines/generation/sql_answer.py
(0 hunks)wren-ai-service/src/pipelines/generation/sql_breakdown.py
(0 hunks)wren-ai-service/src/pipelines/generation/sql_correction.py
(0 hunks)wren-ai-service/src/pipelines/generation/sql_expansion.py
(0 hunks)wren-ai-service/src/pipelines/generation/sql_explanation.py
(0 hunks)wren-ai-service/src/pipelines/generation/sql_generation.py
(4 hunks)wren-ai-service/src/pipelines/generation/sql_regeneration.py
(0 hunks)wren-ai-service/src/pipelines/generation/sql_summary.py
(0 hunks)wren-ai-service/src/pipelines/indexing/__init__.py
(1 hunks)wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py
(1 hunks)wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py
(1 hunks)wren-ai-service/src/pipelines/retrieval/__init__.py
(1 hunks)wren-ai-service/src/pipelines/retrieval/historical_question_retrieval.py
(5 hunks)wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py
(0 hunks)wren-ai-service/src/pipelines/retrieval/retrieval.py
(0 hunks)wren-ai-service/src/pipelines/retrieval/sql_executor.py
(0 hunks)wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py
(1 hunks)wren-ai-service/src/utils.py
(0 hunks)wren-ai-service/src/web/development.py
(0 hunks)wren-ai-service/src/web/v1/routers/__init__.py
(2 hunks)wren-ai-service/src/web/v1/routers/ask.py
(0 hunks)wren-ai-service/src/web/v1/routers/ask_details.py
(0 hunks)wren-ai-service/src/web/v1/routers/semantics_preparation.py
(0 hunks)wren-ai-service/src/web/v1/routers/sql_answers.py
(0 hunks)wren-ai-service/src/web/v1/routers/sql_expansions.py
(0 hunks)wren-ai-service/src/web/v1/routers/sql_explanations.py
(0 hunks)wren-ai-service/src/web/v1/routers/sql_pairs_preparation.py
(1 hunks)wren-ai-service/src/web/v1/routers/sql_regenerations.py
(0 hunks)wren-ai-service/src/web/v1/services/ask.py
(5 hunks)wren-ai-service/src/web/v1/services/ask_details.py
(1 hunks)wren-ai-service/src/web/v1/services/chart.py
(1 hunks)wren-ai-service/src/web/v1/services/chart_adjustment.py
(1 hunks)wren-ai-service/src/web/v1/services/semantics_preparation.py
(0 hunks)wren-ai-service/src/web/v1/services/sql_answer.py
(1 hunks)wren-ai-service/src/web/v1/services/sql_expansion.py
(1 hunks)wren-ai-service/src/web/v1/services/sql_explanation.py
(1 hunks)wren-ai-service/src/web/v1/services/sql_pairs_preparation.py
(1 hunks)wren-ai-service/src/web/v1/services/sql_regeneration.py
(1 hunks)wren-ai-service/tests/data/config.test.yaml
(1 hunks)wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py
(1 hunks)wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_preparation.py
(1 hunks)wren-ai-service/tests/pytest/services/mocks.py
(4 hunks)wren-ai-service/tests/pytest/services/test_ask.py
(3 hunks)wren-ai-service/tests/pytest/services/test_ask_details.py
(2 hunks)wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py
(1 hunks)wren-ai-service/tests/pytest/test_config.py
(0 hunks)wren-ai-service/tests/pytest/test_main.py
(1 hunks)wren-ai-service/tests/pytest/test_utils.py
(1 hunks)wren-ai-service/tools/config/config.example.yaml
(1 hunks)wren-ai-service/tools/config/config.full.yaml
(1 hunks)wren-ui/README.md
(0 hunks)
💤 Files with no reviewable changes (28)
- wren-ai-service/src/config.py
- wren-ui/README.md
- wren-ai-service/tests/pytest/test_config.py
- wren-ai-service/src/web/v1/routers/sql_explanations.py
- wren-ai-service/src/web/v1/routers/ask_details.py
- wren-ai-service/src/web/v1/services/semantics_preparation.py
- wren-ai-service/README.md
- wren-ai-service/docs/configuration.md
- wren-ai-service/src/web/v1/routers/sql_answers.py
- wren-ai-service/src/web/development.py
- wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py
- wren-ai-service/src/web/v1/routers/sql_regenerations.py
- wren-ai-service/src/pipelines/generation/data_assistance.py
- wren-ai-service/src/pipelines/retrieval/sql_executor.py
- wren-ai-service/src/pipelines/generation/sql_breakdown.py
- wren-ai-service/src/pipelines/generation/sql_answer.py
- wren-ai-service/src/web/v1/routers/sql_expansions.py
- wren-ai-service/src/utils.py
- wren-ai-service/src/web/v1/routers/ask.py
- wren-ai-service/src/pipelines/generation/sql_expansion.py
- wren-ai-service/src/web/v1/routers/semantics_preparation.py
- wren-ai-service/src/pipelines/generation/sql_correction.py
- wren-ai-service/src/pipelines/generation/intent_classification.py
- wren-ai-service/src/pipelines/generation/sql_summary.py
- wren-ai-service/src/pipelines/generation/sql_explanation.py
- wren-ai-service/src/pipelines/generation/sql_regeneration.py
- wren-ai-service/src/pipelines/generation/chart_generation.py
- wren-ai-service/src/pipelines/retrieval/retrieval.py
🧰 Additional context used
🪛 yamllint (1.35.1)
deployment/kustomizations/base/cm.yaml
[error] 132-132: syntax error: could not find expected ':'
(syntax)
wren-ai-service/tools/config/config.full.yaml
[error] 150-150: trailing spaces
(trailing-spaces)
docker/config.example.yaml
[error] 113-113: trailing spaces
(trailing-spaces)
wren-ai-service/tools/config/config.example.yaml
[error] 131-131: trailing spaces
(trailing-spaces)
🔇 Additional comments (90)
wren-ai-service/src/pipelines/generation/followup_sql_generation.py (6)
3-3
: Import statement looks good.
The newly imported Dict
is now utilized for the sql_samples
parameter. This aligns well with the code's usage of typed dictionaries in the subsequent changes.
107-116
: Check for missing or invalid data in template rendering.
You’re conditionally injecting {{sample.summary}}
and {{sample.sql}}
within the prompt. It might be worth adding validation or at least a fallback if summary
or sql
fields are missing or None
. This ensures the rendered prompt doesn't break if partial data is passed.
133-133
: Optional parameter addition is consistent with Python 3.10+.
Declaring sql_samples: List[Dict] | None = None
requires Python 3.10 or above (for the union operator “|”). Ensure the runtime environment is updated accordingly if it isn’t already.
145-145
: Parameter forwarding to prompt_builder.run()
is valid.
Forwarding sql_samples
here is consistent with the prompt function’s updated signature and ensures that sql_samples
are utilized in the Jinja template.
220-220
: Extended method signature for run()
.
Adding the sql_samples
parameter broadens the function’s scope to incorporate SQL examples. Verify that all upstream invocations provide or handle this parameter properly.
232-232
: Ensure consistent usage of sql_samples
in downstream pipelines.
The dictionary key sql_samples
is correctly added. Verify that any follow-up or third-party pipeline steps expecting additional data still function as intended.
wren-ai-service/src/web/v1/services/chart_adjustment.py (1)
Line range hint 42-65
: Double-check for potential gaps left by removing the @async_timer
decorator.
The removal of the @async_timer
decorator might impact any monitoring, performance analysis, or logging processes that previously relied on it. If there's still a need for execution time tracking, ensure that you have another suitable mechanism in place or confirm that timing is no longer a project requirement.
wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py (4)
36-51
: Validate filters logic in count_documents.
When id
is None, filters are not applied. Ensure that this is intended and that it won’t return documents from unintended projects. If project isolation is crucial, you may want to handle the None
case differently.
53-59
: Check for empty embeddings usage.
If count_documents
is zero, the function returns {}
. Validate that an empty embedding dict does not lead to unexpected behavior in later steps, e.g., silent failures or bypassed logic.
92-100
: Handle potential empty outputs in formatted_output.
When filtered_documents
is empty or doesn’t contain the key "documents"
, the function returns an empty list by default. Verify the calling code properly handles the scenario of zero documents.
129-139
: Confirm resilience in the run method.
The pipeline depends on external components (_components
). If any are improperly initialized or fail during execution, confirm that the pipeline either retries or raises a clear exception.
wren-ai-service/src/web/v1/services/sql_pairs_preparation.py (3)
20-34
: Ensure each SqlPairsPreparationRequest has a unique query_id.
The _query_id
is optional but is used in status tracking. An accidental reuse of query_id may overwrite a previous request’s status.
57-60
: Consider input validation for sql_pairs_preparation_id.
No validation is performed on sql_pairs_preparation_id
. Add checks for format or allowed characters, especially if used as a key in caches or logs.
83-125
: Add concurrency or partial-completion checks in prepare_sql_pairs.
The method updates the cache status on success/failure. If multiple calls with the same query_id occur concurrently, the status might fluctuate unexpectedly. Consider locking or unique IDs.
wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py (5)
51-78
: Consider adding validation or a fallback for incomplete fields in the run method (SqlPairsDescriptionConverter).
If sql_pair
data is missing keys, .get()
returns None
. Confirm that id
or sql
is always present to avoid empty documents or indexing errors.
117-124
: Input validation for convert_sql_pairs_to_documents.
Similar to the SqlPairsDescriptionConverter
, ensure fields are present and typed correctly.
136-147
: Confirm the rationale for calling delete_sql_pairs within the same pipeline.
delete_sql_pairs
references sql_pairs_cleaner
. If these pairs are also needed for embedding or further processing, ensure the cleaning logic is indeed desired here.
162-171
: Monitor JSON schema validity.
Ensure the defined JSON schema in SQL_INTENTION_GENERATION_MODEL_KWARGS
remains in sync if additional fields are added to SqlIntentionGenerationResult
.
219-226
: Validate the final pipeline run for empty input.
If sql_pairs=[]
, the pipeline might skip certain steps. Confirm that this path is tested and returns a meaningful response to the caller.
wren-ai-service/src/globals.py (5)
96-99
: Check historical_question_retrieval usage.
Renaming historical_question
to historical_question_retrieval
might require changes in tests or data references. Verify consistent usage across the codebase.
105-108
: Check completeness of sql_generation and sql_correction logic.
These pipelines can be interdependent. If one fails, confirm the fallback or error route is properly handled.
141-144
: Potential concurrency checks in preprocess_sql_data.
If preprocess_sql_data
fetches or transforms large data sets, confirm concurrency usage is safe.
168-171
: Remove or confirm necessity of sql_correction in expansion.
sql_expansion
sometimes duplicates logic found in sql_correction
. Confirm there’s no overlap or repeated overhead.
Line range hint 206-215
: Verify that retrieval pipelines are correctly reused.
question_recommendation
reuses database retrieval logic. Confirm that the pipeline handles different search contexts without collisions.
wren-ai-service/src/pipelines/retrieval/__init__.py (1)
7-13
: Use the all list consistently.
Explicitly listing public exports in __all__
is a good practice. Periodically check if new classes need to be added or removed as the module evolves.
wren-ai-service/src/web/v1/routers/__init__.py (4)
11-11
: Use of singular naming improves clarity
Renaming semantics_preparations
to semantics_preparation
provides a clearer singular concept. This is consistent with best practices for naming modules and routes.
15-15
: New router inclusion for SQL Pairs Preparation
Great addition of sql_pairs_preparation
. This aligns with the PR objective of expanding SQL pairs functionality.
25-25
: Confirm references to updated router name
Ensure references to the old module or plural naming (semantics_preparations
) are removed throughout the codebase to avoid confusion or import errors.
32-32
: SQL pairs preparation route
This newly added route effectively wires up the SQL Pairs Preparation service. Looks good for handling the new functionality.
wren-ai-service/src/pipelines/generation/__init__.py (2)
1-16
: Comprehensive imports
All newly imported classes are succinctly grouped. This enhances discoverability and modular usage in downstream code. Ensure these imports do not introduce circular dependencies.
18-35
: Explicit public API
Defining __all__
clarifies the intended exports and prevents accidental usage of internal classes. Good practice!
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (4)
10-11
: Asynchronous test function
Using @pytest.mark.asyncio
is correct for testing async code. This allows for better concurrency handling while verifying pipeline behaviors.
21-24
: Multiple SQL pairs insertion
The SqlPair
instantiation ensures each pair has a unique identifier. This tests that deletion operations can handle multiple entries, which is a solid coverage scenario.
37-38
: Validation after first deletion
The assertion that the document count is still 2 provides a partial deletion scenario verification. This is good for ensuring partial or selective deletions operate as expected.
39-42
: Full deletion scenario
Asserting 0
documents remain ensures complete deletion functionality is verified. This thoroughly validates the end-to-end flow of the pipeline.
wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py (3)
16-23
: Leverage SqlPairsCleaner
effectively
The delete_sql_pairs
function simplifies the pipeline approach by delegating deletion to sql_pairs_cleaner
. This encapsulation is clean and maintainable.
29-40
: Pipeline constructor is coherent
Passing in a DocumentStoreProvider
and building a dictionary of components is a structured approach. Ensure that the dataset_name="sql_pairs"
remains consistent with any future expansions (e.g. multi-dataset support).
45-58
: Final pipeline execution
Using self._pipe.execute
within run
is flexible, allowing additional steps to be added if the pipeline grows. The observe
decorator aligns well with logging and monitoring.
wren-ai-service/tests/pytest/services/mocks.py (5)
3-3
: Consolidated imports look good.
The consolidated import from src.pipelines
helps keep the imports cleaner. No issues here.
16-16
: Ensure consistent naming usage.
HistoricalQuestionMock
now inherits from HistoricalQuestionRetrieval
. Verify all references to HistoricalQuestionMock
and historical_question.HistoricalQuestion
across the codebase are updated or removed as needed.
24-24
: Inheritance updated to IntentClassification
.
Confirm that the prior import references to intent_classification.IntentClassification
have been removed or updated.
34-34
: Refactored to use generation.SQLGeneration
.
Check all references to sql_generation.SQLGeneration
to ensure they are replaced with the new generation.SQLGeneration
.
55-55
: No issues with the renamed SQLSummary.
Inheriting from generation.SQLSummary
is consistent with the new import patterns.
wren-ai-service/tests/pytest/test_utils.py (1)
77-77
: user_id
parameter is set to None
.
This ensures consistency with the broader removal of user_id
from various request classes. Verify that any logic relying on user_id
has been updated accordingly.
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_preparation.py (7)
1-6
: Imports are modular and clear.
Imports for settings
, DocumentStoreProvider
, and SqlPairsPreparation
are well arranged and explicit.
9-19
: Test function naming and structure are descriptive.
The test name test_sql_pairs_preparation_saving_to_document_store
accurately conveys the test’s purpose. Setup steps (e.g., store creation) are straightforward.
20-29
: SqlPairsPreparation
instantiation is well-focused.
Parameters are clearly passed in. Validate that the pipeline components defined in settings.components["sql_pairs_preparation"]
exist and are correct.
31-37
: Document validations look correct.
All essential assertions (content, meta, sql_pair_id
, sql
) are present. This ensures robust coverage.
40-50
: Second test function effectively covers multiple IDs scenario.
This test replicates the pipeline run with different project IDs, ensuring coverage for varied contexts.
51-69
: Repeated run logic is consistent.
Running the pipeline multiple times is a valid approach to confirm isolation by id
.
70-79
: Filter-based validation is sound.
Filtering documents by project_id
checks data partitioning logic and ensures correct categorization per ID.
wren-ai-service/tests/pytest/services/test_ask_details.py (2)
6-6
: Broad import of generation
is consistent with ongoing refactoring.
No concerns here; it aligns with other updated references to the generation
module.
20-23
: Creation of SQLBreakdown
and SQLSummary
is aligned with the new prefix.
Ensure references to the older module or class names (sql_breakdown
/ sql_summary
) are removed or replaced throughout.
wren-ai-service/src/pipelines/retrieval/historical_question_retrieval.py (4)
13-13
: [Import usage confirmation] ScoreFilter appears to be correctly imported and used.
It’s good to see ScoreFilter
continue to be used here. Ensure no references to any removed or renamed classes remain.
107-107
: [Rename alignment] The new class name aligns well with its retrieval purpose.
Renaming HistoricalQuestion
→ HistoricalQuestionRetrieval
clarifies the pipeline’s intent. Confirm that all references to the old name have been removed throughout the codebase.
132-132
: [Log statement consistency] The logger statement is appropriate to the renamed class.
Good job updating the logger message to match the new HistoricalQuestionRetrieval
pipeline name.
147-147
: [Consistency check] The pipeline’s dry run reference is properly updated to the new class name.
This usage in dry_run_pipeline
looks correct.
wren-ai-service/src/web/v1/services/sql_explanation.py (1)
55-55
: [Naming practice] Changing the class name from SQLExplanationService
→ SqlExplanationService
improves consistency.
Ensure references to the previous name are fully removed, and confirm adherence to project naming conventions.
wren-ai-service/src/web/v1/services/sql_answer.py (1)
10-10
: [Trace metadata addition] The trace_metadata
decorator could provide valuable debugging insight.
No apparent issues with the addition. Confirm that the relevant logs or tracing dashboards can consume this metadata effectively.
wren-ai-service/src/web/v1/services/sql_regeneration.py (1)
78-78
: [Service rename] Transitioning from SQLRegenerationService
to SqlRegenerationService
is consistent with the style used elsewhere.
Verify that all imports or references reflect this updated name.
wren-ai-service/src/pipelines/generation/sql_generation.py (4)
85-85
: Parameter rename looks consistent.
Changing samples
to sql_samples
clarifies usage. Ensure any external references to this parameter have also been updated to avoid runtime errors.
93-93
: Good usage of the newly introduced parameter.
Forwarding sql_samples
to the prompt builder is correctly aligned with the updated parameter name.
168-168
: Addition of the sql_samples
parameter aligns with the prompt signature.
The run()
method signature now correctly matches the updated function parameters, improving readability.
178-178
: Properly passing sql_samples
.
Adding sql_samples
to the inputs ensures consistent data flow through the pipeline.
wren-ai-service/src/web/v1/routers/sql_pairs_preparation.py (3)
68-88
: POST /sql-pairs endpoint creation looks solid.
The endpoint’s asynchronous flow using BackgroundTasks
is a neat solution for offloading long-running tasks. Make sure to handle potential exceptions within the background task to keep the status consistent.
91-111
: DELETE /sql-pairs endpoint is well-structured.
The logic for generating a unique query ID and tracking deletion status is clearly implemented. Confirm that any exceptions in the deletion process also appropriately update the status.
114-123
: GET /sql-pairs/{sql_pairs_preparation_id} endpoint.
Retrieving the current status based on the unique ID completes a full asynchronous workflow. Looks good for monitoring and handling user queries about the process.
wren-ai-service/src/web/v1/services/chart.py (1)
9-9
: New import for tracing metadata.
The introduction of trace_metadata
aligns with the simplified removal of @async_timer
. Ensuring each request is traceable without extra timers makes the code cleaner.
wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py (3)
18-32
: Fixture for sql_pairs_preparation_service
.
The fixture sets up the environment and document store for SQL pairs. This is a clean approach to ensure each test starts with the correct, isolated context.
49-93
: Test for SQL pairs preparation.
The loop that checks status until "finished" or "failed" ensures asynchronous correctness. Great job verifying the final state of the document store after preparation.
95-164
: Test for SQL pairs deletion.
Reusing the same query_id
for deletion is a nice approach that demonstrates a real-world flow (prepare → delete). Checking the empty document store confirms the logic is complete.
wren-ai-service/src/web/v1/services/ask_details.py (2)
10-10
: Ensure import consistency and usage clarity.
The added import of trace_metadata
appears consistent with the usage in the ask_details
method. No issues found.
Line range hint 25-36
: Confirm removal of user_id
from AskDetailsRequest
is fully handled.
We see that user_id
is no longer part of the class. Please verify that all references to user_id
have been removed or refactored accordingly elsewhere, so that downstream functionality is not impacted.
wren-ai-service/eval/evaluation.py (1)
121-122
: Validate fallback path when keys are absent.
Switching to meta.get("session_id")
and meta.get("user_id")
prevents potential KeyError
. Ensure that if these values are absent and become None
, downstream logic can safely handle it or provide a suitable default.
wren-ai-service/tests/pytest/services/test_ask.py (2)
37-52
: Refactored pipeline references look good.
Using generation
and retrieval
modules consistently improves readability and aligns naming across components. Good job!
163-163
: Renamed mock aligns with real service naming.
Updating historical_question
to historical_question_retrieval
maintains consistency. Looks good!
wren-ai-service/tests/pytest/test_main.py (1)
26-26
: Test name correction is fine—ensure references are updated if needed.
Renaming from test_semantics_preparations
to test_semantics_preparation
clarifies the intent. Verify that other references, if any, have been updated consistently.
wren-ai-service/src/web/v1/services/sql_expansion.py (1)
9-9
: Removal of async_timer
looks consistent.
No issues spotted with removing the @async_timer
. The remaining @observe
and @trace_metadata
decorators ensure the method still has observability and trace functionality. This change also aligns with other timer removals throughout the PR.
wren-ai-service/eval/pipelines.py (2)
145-146
: Graceful fallback on missing session and user data.
Switching to .get("session_id")
and .get("user_id")
mitigates potential KeyError exceptions when these fields are missing. This improves resilience.
162-163
: Consistent session and user ID handling.
Similar improvement here—replacing direct dictionary key access with .get(...)
ensures safer metadata retrieval.
wren-ai-service/src/web/v1/services/ask.py (4)
10-10
: Trace metadata import usage is correct.
No further action needed; consistent with how other services incorporate tracing.
245-247
: Renamed pipeline key from historical_question
to historical_question_retrieval
.
This aligns with the updated naming convention in configuration files. No functional issues apparent.
286-286
: sql_samples
parameter for followup_sql_generation
.
Accepting sql_samples
adds flexibility, but confirm that both the retrieval and generation pipeline handle the relevant data shape and potential empty arrays.
297-297
: sql_samples
parameter for sql_generation
.
The same suggestion applies here—validate or test thoroughly to ensure the new parameter is consistently consumed and does not introduce unexpected behaviors.
wren-ai-service/tests/data/config.test.yaml (1)
45-45
: Consistent pipeline rename in config.
Renaming historical_question
to historical_question_retrieval
keeps the configuration aligned with code changes. This helps avoid runtime mismatches.
docker/config.example.yaml (1)
107-117
: LGTM: SQL pairs pipeline configuration.
The new SQL pairs pipelines are properly configured with the required components (document store, embedder, and LLM where needed).
🧰 Tools
🪛 yamllint (1.35.1)
[error] 113-113: trailing spaces
(trailing-spaces)
wren-ai-service/src/pipelines/indexing/__init__.py (1)
119-125
: LGTM: Export list updated correctly.
The __all__
list is properly updated to include the new SQL pairs components.
wren-ai-service/tools/config/config.example.yaml (1)
125-135
: LGTM: SQL pairs pipeline configuration.
The new SQL pairs pipelines are properly configured with the required components (document store, embedder, and LLM where needed).
🧰 Tools
🪛 yamllint (1.35.1)
[error] 131-131: trailing spaces
(trailing-spaces)
wren-ai-service/tools/config/config.full.yaml (1)
144-147
: Consider renaming for consistency with existing patterns
The pipeline name sql_pairs_preparation
could be renamed to sql_pairs_indexing
to maintain consistency with other indexing operations like db_schema_indexing
and historical_question_indexing
.
wren-ai-service/src/pipelines/generation/chart_adjustment.py (1)
175-175
: LGTM: Clean simplification of post-processor invocation
The change to use direct class invocation instead of the run
method simplifies the code while maintaining the same functionality.
# POST /v1/sql-pairs | ||
class SqlPair(BaseModel): | ||
sql: str | ||
id: str | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add validation or docstrings for SqlPair.
Although it’s straightforward, clarifying constraints on the sql
field (like maximum length or allowed characters) can prevent upstream injection issues.
@observe(name="Delete SQL Pairs") | ||
@trace_metadata | ||
async def delete_sql_pairs( | ||
self, | ||
delete_sql_pairs_request: DeleteSqlPairsRequest, | ||
**kwargs, | ||
): | ||
results = { | ||
"metadata": { | ||
"error_type": "", | ||
"error_message": "", | ||
}, | ||
} | ||
|
||
try: | ||
await self._pipelines["sql_pairs_deletion"].run( | ||
sql_pair_ids=delete_sql_pairs_request.ids, | ||
id=delete_sql_pairs_request.project_id, | ||
) | ||
|
||
self._prepare_sql_pairs_statuses[ | ||
delete_sql_pairs_request.query_id | ||
] = SqlPairsPreparationStatusResponse( | ||
status="finished", | ||
) | ||
except Exception as e: | ||
logger.exception(f"Failed to delete SQL pairs: {e}") | ||
|
||
self._prepare_sql_pairs_statuses[ | ||
delete_sql_pairs_request.query_id | ||
] = SqlPairsPreparationStatusResponse( | ||
status="failed", | ||
error=SqlPairsPreparationStatusResponse.SqlPairsPreparationError( | ||
code="OTHERS", | ||
message=f"Failed to delete SQL pairs: {e}", | ||
), | ||
) | ||
|
||
results["metadata"]["error_type"] = "DELETION_FAILED" | ||
results["metadata"]["error_message"] = str(e) | ||
|
||
return results | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid silent data corruption in delete_sql_pairs.
If an exception occurs mid-deletion, some pairs might be partially removed. Consider introducing a rollback approach or partial-completion logs for better transparency.
|
||
sql_intention_generation_system_prompt = """ | ||
### TASK ### | ||
|
||
You are a data analyst great at generating the concise and readable summary of a SQL query. | ||
|
||
### INSTRUCTIONS ### | ||
|
||
- Summary should be concise and readable. | ||
- Summary should be no longer than 20 words. | ||
- Don't rephrase keywords in the SQL query, just use them as they are. | ||
|
||
### OUTPUT ### | ||
|
||
You need to output a JSON object as following: | ||
{ | ||
"intention": "<CONCISE_AND_READABLE_SUMMARY_STRING>" | ||
} | ||
""" | ||
|
||
sql_intention_generation_user_prompt_template = """ | ||
### INPUT ### | ||
SQL: {{sql}} | ||
|
||
Please think step by step | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review system prompt for potential PII exposure.
The system prompt instructs on summarizing queries. If queries can contain sensitive data, ensure that the prompt or the generation doesn't expose private details externally, especially when using an external LLM.
@observe() | ||
def post_process( | ||
generate_sql_intention: List[dict], | ||
sql_pairs: List[SqlPair], | ||
) -> List[Dict[str, Any]]: | ||
intentions = [ | ||
orjson.loads(result["replies"][0])["intention"] | ||
for result in generate_sql_intention | ||
] | ||
|
||
return [ | ||
{"id": sql_pair.id, "sql": sql_pair.sql, "intention": intention} | ||
for sql_pair, intention in zip(sql_pairs, intentions) | ||
] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heavy reliance on user content in orjson decoding.
Relying on replies
content from external calls can lead to decode errors. Confirm robust error handling for malformed or incomplete JSON.
@observe(capture_input=False) | ||
async def write_sql_pairs( | ||
embed_sql_pairs: Dict[str, Any], | ||
sql_pairs_writer: AsyncDocumentWriter, | ||
) -> None: | ||
return await sql_pairs_writer.run(documents=embed_sql_pairs["documents"]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Disable partial writes if a portion fails.
write_sql_pairs
does not appear to handle exceptions or partial writes. Consider transaction-like mechanisms: all or nothing, or log partial success.
- name: chart_generation | ||
llm: litellm_llm.gpt-4o-mini-2024-07-18 | ||
- name: chart_adjustment | ||
llm: litellm_llm.gpt-4o-mini-2024-07-18 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicate pipeline entries.
The chart_generation
and chart_adjustment
pipelines are duplicated. Remove the duplicate entries as they are identical to the existing ones.
- - name: chart_generation
- llm: litellm_llm.gpt-4o-mini-2024-07-18
- - name: chart_adjustment
- llm: litellm_llm.gpt-4o-mini-2024-07-18
- name: chart_generation | ||
llm: litellm_llm.gpt-4o-mini-2024-07-18 | ||
- name: chart_adjustment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicate chart pipeline entries
The chart_generation
and chart_adjustment
entries appear twice in the pipeline configuration. Remove the duplicate entries to avoid confusion.
- name: preprocess_sql_data | ||
llm: litellm_llm.gpt-4o-mini-2024-07-18 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicate pipeline entry
The preprocess_sql_data
pipeline entry appears twice in the configuration. Remove the duplicate entry to maintain a clean configuration.
<<<<<<< HEAD | ||
======= | ||
- name: preprocess_sql_data | ||
llm: litellm_llm.gpt-4o-mini-2024-07-18 | ||
>>>>>>> main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolve merge conflict markers
The file contains unresolved merge conflict markers. Please resolve the conflict and remove the markers.
-<<<<<<< HEAD
-=======
- - name: preprocess_sql_data
- llm: litellm_llm.gpt-4o-mini-2024-07-18
->>>>>>> main
🧰 Tools
🪛 yamllint (1.35.1)
[error] 132-132: syntax error: could not find expected ':'
(syntax)
async_timer
,timer
historical_question
tohistorical_question_retrieval
user_id
, since we don't use it in apisSummary by CodeRabbit
New Features
sql_pairs_preparation
,sql_pairs_deletion
, andsql_pairs_retrieval
.Bug Fixes
Documentation
Chores